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BUS ARCHITECTURE EMPLOYING 
VARYING WIDTH UNI DIRECTIONAL 
COMMAND BUS 

CROSS-REFERENCE TO RELATED 5 
APPLICATIONS 

The present application is related to the following com- 
monly assigned co-pending applications, filed on the same 
date as the present application, which are herein incorpo- 
rated by reference: 10 
Ser. No. 09/439,189, to Robert A. Drehmel, et al., entitled 
"Processor-Memory Bus Architecture for Supporting 
Multiple Processors"; and 
Ser. No. 09/439,586, to H. Lee Blackmon, et al., entitled 
"Data Routing Using Status — Response Signals". 35 

FIELD OF THE INVENTION 

The present invention relates to digital data processing, 
and in particular to the design of high-speed communication 2Q 
buses for linking processors, memory and other components 
of a computer system. 

BACKGROUND OF THE INVENTION 

A modem computer system typically comprises a central ^ 
processing unit (CPU) and supporting hardware necessary to 
store, retrieve and transfer information, such as communi- 
cations buses and memory. It also includes hardware nec- 
essary to communicate with the outside world, such as 
input/output controllers or storage controllers, and devices 3Q 
attached thereto such as keyboards, monitors, tape drives, 
disk drives, communication lines coupled to a network, etc. 
The CPU is the heart of the system. It executes the instruc- 
tions which comprise a computer program and directs the 
operation of the other system components. 35 

From the standpoint of the computer's hardware, most 
systems operate in fundamentally the same manner. Proces- 
sors are capable of performing a limited set of very simple 
operations, such as arithmetic, logical comparisons, and 
movement of data from one location to another. But each 40 
operation is performed very quickly. Programs which direct 
a computer to perform massive numbers of these simple 
operations give the illusion that the computer is doing 
something sophisticated. What is perceived by the user as a 
new or improved capability of a computer system is made 45 
possible by performing essentially the same set of very 
simple operations, but doing it much faster. Therefore con- 
tinuing improvements to computer systems require that 
these systems be made ever faster. 

The overall speed of a computer system (also called the 50 
throughput) may be crudely measured as the number of 
operations performed per unit of time. Conceptually, the 
simplest of all possible improvements to system speed is to 
increase the clock speeds of the various components, and 
particularly the clock speed of the processors). E.g., if 55 
everything runs twice as fast but otherwise works in exactly 
the same manner, the system will perform a given task in 
half the time. Early computer processors, which were con- 
structed from many discrete components, were susceptible 
to significant speed improvements by shrinking component 60 
size, reducing component number, and eventually, packag- 
ing the entire processor as an integrated circuit on a single 
chip. The reduced size made it possible to increase clock 
speed of the processor, and accordingly increase system 
speed. 65 

Despite the enormous improvement in speed obtained 
from integrated circuitry, the demand for ever faster com- 
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puter systems has continued. Hardware designers have been 
able to obtain still further improvements in speed by greater 
integration (i.e., increasing the number of circuits packed 
onto a single chip), by further reducing the size of circuits, 
and by various other techniques. However, designers can see 
that physical size reductions can not continue indefinitely, 
and there are limits to their ability to continue to increase 
clock speeds of processors. Attention has therefore been 
directed to other approaches for further improvements in 
overall speed of the computer system. 

Without changing the clock speed, it is possible to 
improve system throughput by using multiple processors. 
The modest cost of individual processors packaged on 
integrated circuit chips has made this approach practical. 
However, one does not simply double a system's throughput 
by going from one processor to two. The introduction of 
multiple processors to a system creates numerous architec- 
tural problems. For example, the multiple processors will 
typically share the same main memory (although each 
processor may have its own cache). It is therefore necessary 
to devise mechanisms that avoid memory access conflicts, 
and assure that extra copies of data in caches are tracked in 
a coherent fashion. Furthermore, each processor puts addi- 
tional demands on the other components of the system such 
as storage, I/O, memory, and particularly, the communica- 
tions buses that connect various components. As more 
processors are introduced, there is greater likelihood that 
processors will spend significant time waiting for some 
resource being used by another processor. 

All of these issues and more are known by system 
designers, and have been addressed in one form or another. 
While perfect solutions are not available, improvements in 
this field continue to be made. 

Of particular interest herein is the design of communica- 
tions buses. In simple computer systems, all major compo- 
nents such as processor, memory, storage controllers, and 
I/O are connected on a single multi-drop communications 
bus. Physically, such a multidrop bus is a common set of 
parallel conductors, and each component is connected to the 
these conductors through logic drivers or gates. The archi- 
tecture of such a bus permits any arbitrary component 
connected to the bus to send data to any other arbitrary 
component connected to the bus, although it is not neces- 
sarily the case that all possible combinations are actually 
used. Since only one component may send data at any time, 
the component sending data must first obtain control of the 
bus, a process known as arbitration. The bus typically has an 
address portion for specifying the receiving device(s), and a 
data portion for specifying the data being transferred. It may 
also have various control lines. 

The clock speed at which a multi-drop bus operates is 
limited by the number of attached devices, their physical 
configuration with respect to one another, the speed at which 
individual devices operate, and other factors. For this 
reason, many computer systems have multiple buses. In 
particular, processors and memory may be coupled to a 
relatively high-speed bus, while storage and I/O devices 
may be coupled to a slower bus. Since processors and 
memory typically require a higher speed, and are physically 
close enough to support higher speed bus operation, isola- 
tion of processors and memory from the lower speed devices 
such as storage and I/O by using a special processor-memory 
bus supports bus operation at higher speed and improves 
system performance. 

However, the demand for increased system throughput 
continues. It is desirable to increase the number of proces- 
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sors in a computer system to increase system throughput. 
However, the high-speed multi-drop processor-memory bus 
was intended for a relatively small number of attached 
components. As the number of processors attached to such 
a bus increases, it becomes difficult or impossible to operate 
the bus at the higher clock speeds necessary to support 
communication among the various components. Moreover, 
the simple creation of wider buses or of additional (parallel) 
buses is not always a practical solution. Wider or additional 
buses means that each processor must have additional I/O 
pins, where the number of I/O pins is already extremely 
constrained. 

Some designers have attempted to address this problem 
using hierarchical buses, in which each processor is assigned 
to a node, all processors within a node being on the same 
local bus coupled to a node controller, wherein the node 
controller handles communications with devices in other 
nodes through a separate remote bus. However, these 
designs require a great deal of complexity on the part of the 
node controller, with attendant cost and collateral issues. 

There is a need for an alternative high-speed communi- 
cation path architecture in a computer system for supporting 
communication among larger numbers of processors and 
memory. 

SUMMARY OF THE INVENTION 

It is, therefore, an object of the present invention to 
provide an enhanced multi-processor computer system. 

Another object of this invention is provide an enhanced 
processor-to-memory communication path for a multipro- 
cessor computer system 

Another object of this invention is to support an increased 
number of processors in a multiprocessor computer system. 

Another object of this invention is to reduce the number 
of I/O pins and other hardware required to support commu- 
nications in a multiprocessor system. 

An internal communication network for supporting data 
communication among multiple processors and memory 
within a computer system comprises a command portion for 
transmitting addresses and commands, having a unidirec- 
tional input bus portion for transmitting commands to a 
central command repeater unit, and a unidirectional broad- 
cast bus portion for broadcasting commands from the central 
command repeater unit. The input portion comprises a 
plurality of links running from different devices, wherein 
each link is less than the Ml width of the broadcast bus 
portion. A command is transmitted over the input portion in 
a plurality of bus cycles, and broadcast over the broadcast 
portion in a single bus cycle. Since multiple input links 
connect to the central command repeater, it is possible to 
keep the broadcast bus full notwithstanding the fact that 
multiple bus cycles are required to transmit an individual 
command on the input portion. 

In the preferred embodiment, the links are arranged 
hierarchically. A series of unidirectional links runs between 
processors and local address repeater units (ARPs), and 
between the ARPs and the central command repeater, called 
an address switch unit (ASW), all of these links being 
half-width. A data transfer command propagates from a 
requesting device to its local ARP, to the ASW, requiring two 
bus cycles for each stage. From the ASW, the command is 
broadcast to all component devices on the network in a 
single bus cycle by transmitting to all ARPs or directly 
attached memory on a separate set of full-width unidirec- 
tional links. The ARPs then repeate the transmission to all 
attached processors or other units on another set of uni- 
directional links. 
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In the preferred embodiment, the ASW globally arbitrates 
the command bus. A request by a processor to transmit an 
address to the ARP must be granted first by the ASW. Once 
granted, the command will propagate in a pre-defined num- 
5 ber of clock cycles to the ASW through the ARP, and thus 
addresses are not buffered in the ARP (although they are held 
in a register for a small number of cycles during 
re-propagation). The command is then broadcast, again at 
pre-defined bus cycles from initial bus grant. 
io In the preferred embodiment, addresses/commands and 
data are transmitted on essentially separate paths having 
different topologies, and at different times, and are arbitrated 
separately. The data portion of the network comprises a set 
of bidirectional links from the processors to a local data 
15 switch unit (DSW). The local DSW is further linked directly 
to memory via bi-directional links. In fact, the data portion, 
of the network contains multiple independent data paths 
supporting multiple simultaneous data transfers, all of which 
are supported by a single logical hierarchical address bus 
20 portion. No address is transmitted with the data; rather, a tag 
is transmitted which identifies the command with which the 
data is associated. 

Consistent with commonly understood terminology, the 
network is herein referred to collectively as a "bus" or 
25 "memory bus" (the latter to distinguish it from I/O buses). 
The portion of the network which transmits addresses and 
commands is sometimes referred to as the "address bus", 
while the data portion is referred to as the "data bus", and 
other portions of the network are similarly designated. It will 
30 be understood that the communications network described 
herein is not physically the same as a classical multi-drop 
bus, although it performs the analogous function. 
The bus described herein acts has characteristics of a 
35 pipeline, wherein high clock speed (and high throughput) is 
achieved by staging bus operations over a number of cycles. 
The full-width broadcast bus is kept full because multiple 
commands can be received by the central repeater in an 
overlapping fashion on the multiple input links. The use of 
multiple hierarchical links enables the entire memory bus to 
operate at high clock speed. Furthermore, because the 
repeaters and ASW do not buffer data or determine desti- 
nation through complex directories, the design is greatly 
simplified vis-a-vis typical prior art hierarchical designs. All 
^ of these considerations make it possible to support a rela- 
tively large number of processors at a high throughput. 

The details of the present invention, both as to its structure 
and operation, can best be understood in reference to the 
accompanying drawings, in which like reference numerals 
5Q refer to like parts, and in which: 

BRIEF DESCRIPTION OF THE DRAWING 

FIG. 1 shows the major hardware components of a 
multiprocessor computer system for utilizing the high speed 
55 memory bus architecture according to the preferred embodi- 
ment of the present invention. 

FIG. 2 shows in greater detail the overall topology of the 
memory bus, according to the preferred embodiment. 
FIG. 3 shows in greater detail the components of the 
60 address bus portion of the memory bus, according to the 
preferred embodiment. 

FIG. 4 shows in greater detail the components of the 
status bus portion of the memory bus, according to the 
preferred embodiment. 
65 FIG. 5 shows in greater detail the components of the 
response bus portion of the memory bus, according to the 
preferred embodiment. 
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FIG. 6 shows in greater detail the data bus portion of the " simplicity, CPUs, bus interface units, and system I/O buses 

memory bus, in accordance with the preferred embodiment. are herein designated generically by reference numbers 101, 

FIG. 7 shows in greater detail the components which 1&5 and 110, respectively, 

support arbitration of the memory bus, according to the In FIG. 1, feature 103 conceptually represents a memory 

preferred embodiment. 5 5^ f or communicating among the various processors 101, 

FIG. 8 is a timing diagram of bus activity for a command- memory 102 and I/O interface units 105. Clearly, some form 
only bus transmission, according to the preferred embodi- of communication path is required for multi-processor com- 
ment, puter system 100 to function. This function could theoreti- 

FTG. 9 is a timing diagram of bus activity for a read- call y be Performed (although not necessarily at acceptable 

from-memory bus transmission, according to the preferred 10 P erforma nce levels) using a single multi-drop bus connected 

embodiment to *^ tne P roce ssors, memory, and I/O interface units. The 

FIG. 10 is a timing diagram of bus activity for a write- T^j 00 * dir ~ ted « oward P rovidin g essentially 

to-memory bus transmission, according to the preferred ^ funChOD at mpr0Ved ^nnance. 

embodiment s In tne Purred embodiment, computer system 100 is an 

FIG. 11 is a timing diagram of bus activity for a read- J™ f^? ^P r0 <*f° r computer system, it being 

from^evice, where the data source and bus master are in the unde # f ood the P resent ™™*>n could be implemented 

same local node but note on the same local data bus, °« «^ niuitiprocessor computer systems. 

according to the preferred embodiment. FIG - 2 shows m detail the overall topology of 

cto 11 • *• • j- r , .. r ™ memory bus 103. Processors 101A-101H are mounted on 

FIG. 12 is a timing diagram of bus activity for a read- 20 J . . . " u 

r , • „ . , * „ au * electronic circuit cards 201A-201D (herem genencally 

from-device, where the data source and bus master are in , . _ r . ini v v , „ . J 

♦ i i j > r , , referred to as reference number 201), each card defining a 

different local nodes, according to the preferred embodi- • , . „ . , , 77 " .\ 

ment & r separate local node. Each card 201 can hold up to eight 

* . processors 101, although cards will not necessarily always 

FIG. 13 is a timing diagram of bus activity for a read- nave ihc architectural maximum number of processors. Each 

from-device, where the data source and bus master are on card 2 01 also contains an address repeater unit (ARP) 210 

the same local bus, according to the preferred embodiment. and a data switch unit (DSW) 2 11. in the preferred 

FIG. 14 is a timing diagram of bus activity for a write- embodiment, data switch unit 211 is implemented on two 

to-device, according to the preferred embodiment. chips (hence the depiction in FIG. 2), but it acts as a single 

FIG. 15 illustrates a processor chip and its I/O pin 30 functional unit, as described herein. Local address bus links 
requirements at a high level, according to the preferred (i- e -> contained entirely within card 201) connect each pro- 
embodiment, cessor 101 with the ARP unit 210 in its card, while local data 

bus links connect each processor 101 with the DS W unit 211 

DETAILED DESCRIPTION OF THE in its card. These are described more fully below. 

PREFERRED EMBODIMENT ^ , n ^tbn to one or more processor cards 201, each 

The major hardware components of a multiprocessor containing multiple processors, the system will typically 

computer system 100 for utilizing the high speed memory contain one or more I/O cards 202. Each I/O card is also a 

bus architecture according to the preferred embodiment of separate local node in the memory bus network, and contains 

the present invention are shown in FIG. 1. Multiple central up to four I/O bus interface units 105A-105D, each unit 

processing units (CPUs) 101A, 101B and 101C perform 40 communicating (through adapter hardware, not shown) with 

basic machine processing function on instructions and data a respective PCI bus 110A-110D. As in the case of processor 

from main memory 102. Each processor contains or controls cards, each I/O card 202 will not necessarily contain the 

a respective cache. In the preferred embodiment, each architected maximum number of interface units 105. Each 

processor has a separate on-chip LI instruction cache, an I/O card 202 similarly contains an ARP unit 210 and a DSW 

on-chip LI data cache, and an on-chip L2 cache directory/ 45 unit 211, the I/O bus interface units 105 being coupled to 

controller, the L2 cache itself being in a separate chip. ARP 210 by local address bus links, and being coupled to 

However, these cache structures are shown conceptually in DSW 211 by local data bus links, as described more fully 

FIG. 1 as a single block 106A, 106B and 106C for each herein. 

respective processor. For purposes of this invention, the Main memory 102 is contained in one or two memory 
precise implementation details of caching in each processor 50 subsystems 203A-203B (herein generically referred to as 
are not significant, and the caches could be implemented reference number 203), each subsystem having separate 
differently. physical packaging (i.e., each subsystem may be on one or 
I/O bus interface units 105A, 105B communicate with more separate cards or assemblies of cards. Each subsystem 
multiple I/O processing units (IOPs) 111-117 through is also another local node from the perspective of the 
respective system I/O buses 110A, HOB. In the preferred 55 memory bus, although the memory subsystem nodes func- 
embodiment, each system I/O bus is an industry standard tion somewhat differently from processor and I/O interface 
PCI bus. The IOPs support communication with a variety of nodes. The memory controller card contains a memory 
storage and I/O devices, such as direct access storage address interface (MAI) unit 220 and a memory data inter- 
devices (DASD), tape drives, workstations, printers, and face (MDI) unit 221 (the latter being implemented in the 
remote communications lines for communication with 60 preferred embodiment and depicted in FIG. 2 as four chips, 
remote remote devices or other computer systems. While but functionally a single unit). The memory address inter- 
three CPUs, two I/O bus interface units and I/O buses, and face unit 220 receives addresses on the memory bus and 
various numbers of IOPs and other devices are shown in routes address signals to physical memory. Data interface 
FIG. 1, it should be understood that FIG. 1 is intended only unit 221 handles communication of data between the 
as an illustration of the possible types of devices that may be 65 memory bus and physical memory devices. The two 
supported, and the actual number and configuration of memory subsystems are configured so that the identity of the 
CPUs, buses, and various other units may vary. For memory subsystem storing a particular data address is 
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derivable from the address itself, e.g., by taking a single bit received by ARP 210 from ASW 212 on bus 303. ARP 210 

of the address. This assists data routing for various bus makes no decisions as to which processors) or bus interface 

operations. unit(s) will receive the addresses; it simply broadcasts any 

All ARP units 210 and memory address interface units address comi °g in 00 bus 303 10 al1 devices in its Iocal node 

220 are coupled via remote links to a single central address 5 (card). Since remote broadcast bus u> 303 is full width it 

repeaterunit(alsoreferredtoastheaddressswitch,orASW) to l\™ S J?} local broadcast ^ 304 ako J? . 

212 (depicted in FIG. 2 as two chips, but functionally a £ dth - ™" lc connections to only one node 201 are shown in 

♦ I -*\ t^. i* i « « » • *u *u . *u "G. 3, it will be understood that these connections are 

single unit). IHese hnks are remote in the sense ^that they ^ for each ^ ^ 

are go from card to card (mtercard links). ASW 212 propa- from ^ fi fof ^ flf iUustration 

gates addresses among local nodes and handles global 10 ... 

memory bus arbitration single unidirectional memory address broadcast bus 

. , " , . „ _ , „ j 305 runs directly from ASW 212 to memory address inter- 

fe€?^ Cl ^ k SlgDal l 15 15 dKtributcd l ° aU nodes and face units 220 in each of the memory subsystems 203. Bus 

to ASW 212 All memory bus activity is synchronized to this 305 acts ^ bus 303 broadcasting addresses ^ associated 

global signal. ^ control information to the memory subsystems. Memory 
FIG. 3 shows in greater detail the components of the address interface units 220 further propagate this informa- 
address bus portion of memory bus 103. The address bus tioo within the memory subsystems 203, but for purposes of 
portion transmits a destination address, a command code, a understanding memory bus 103, it is not necessary to 
tag, and some miscellaneous bits, and although it contains describe the memory subsystem architecture in detail here, 
information other than the address, it is designated simply Bus 305 is also a full width bus, carrying all the address and 
the "address bus" for convenience. Each processor card 201 associated information in a single bus cycle, 
contains two unidirectional local address request buses „ wi „ ^ observed tha , as W 212 contain five ports 
301A-301B (herein genencally referred to as reference receiving up t0 five half-width remote address 
number 301) running from processors 101 to ARP 210, each muest buses 302 ftom to five ale nodes 201 ^ 
local address request bus 301 supporting up to four proces- 2U can feceive addresses from more than one node simu] . 
sors. Each local address bus 301 is physically a shared set of , ane0 usly. In fact, it is intended that it do so. Addresses 
parallel hues, the number of hues being one-half that on buses 302 m ^.^^ being received m ^ 
required to transmit a complete address along with associ- ate dock cyclcs> whiIe addresses transmitted out on 
ated command, tag, parity and other control information. buses m and 305 are m wid|hj bei transmitted in a 
Le umdu-ecuonal local address request buses 301 are sin ^ e le , n order to utilize buses 303 and 305 to m 
half-width, two bus cycles being required to transfer a capacity> addresses from nodes are received in 
complete address and associated information. Since the bus overlapping concurrent fashion by ASW 212 on the half- 
is shared, only one processor can transmit in any given wi d th buses 302 

cycle. The I/O card202isstoaar, but each local address bus ^ ^ q[ „ uses 301 m rather , han m 

301 supports two I/O interlace units 105. f// ^ ♦ . „ „ , 

^ v 35 width saves precious I/O pins on the processor, ARP and 

A single unidirectional remote address request bus 302A chips In meorV) tf one that aU nodes make ^ 

(herem genencally referred to as reference number 302) runs equal QUmber of reqU ests, tne address request buses ^ be 

from ARP210 to ASW 212. There is a separate bus 302 from 1/N me width of the broadcast bus 303, where N is the 

each ARP 210 to ASW 212 (including ARP 210 in I/O node num ber of nodes which can request use of the memory bus, 

card 202), requiring a separate port (separate set of input ^ Io the present arc hitecture, a half-width (N-2) is chosen, 

pins) for each bus 302. like bus 301, each bus 302 is because the num5er of nodes ^ vary depcnding on cus- 

half-width, and requires two bus cycles to transmit a full tomer configurate and i t ^ difficult to guarantee that nodes 

address along with associated control information. ^ submit an equal number G f requests. However, the bus 

ARP 210 has no buffering capability. The addresses width can vary, and in particular, if the present architectural 

received on buses 301 are held in repeater registers for two 45 concepts are applied to a system having a larger number of 

cycles, but APR 210 can not otherwise hold address infer- DO des and processors, request buses of '/ 3 width, l A width, or 

mation received on bus 301. Therefore, only one of buses smaller may be appropriate. 

301withinacard201cantransmittotheARP210atanyone FIG. 4 shows in greater detail the components of the 

time. ASW 212 grants the right to transmit to a processor slatus bus portion of memory bus 103. The status bus 

101 (or I/O interface unit 105), and ARP 210 simply gates 50 comprises multiple unidirectional point-to-point links run- 

the appropriate one of buses 301 into its repeater registers oing directly between processors 101 or I/O bus interfaces 

for transmission to ASW 212 on remote address request bus units 105 and ASW 212. I.e., these links by-pass ARP 210. 

302- Particularly, a separate unidirectional send status link 

A single respective unidirectional remote address broad- 401A-401B (herein referred to genencally as reference 

cast bus 303A (herein referred to generically as reference 55 number 401) runs from each processor 101 (or I/O bus 

number 303) runs from ASW 212 to each ARP 210. This bus interface unit 105, not shown) to a respective input port on 

is used to transmit all addresses received by ASW 212 to the ASW 212. A separate unidirectional receive status link 

various local nodes, i.e., address requests received by ASW 402A-402B (herein genencally referred to as reference 

212 are broadcast to all the local nodes. Unlike request bus number 402) runs from a respective output port on ASW 212 

302, broadcast bus 303 is full width, capable of transmitting 60 to each respective processor 101 (or I/O bus interface unit 

the entire address and associated control information in a 105, not shown). Each unidirectional point-to-point link 401 

single bus cycle. or 402 is physically two parallel line (not including a parity 

A set of four unidirectional local address broadcast buses line, which is a combined parity line for the status and 

304A-304D (herein generically referred to as reference response buses). While connections to only two processors 

number 304) runs from ARP 210 to the processors 101 (or, 65 are shown in FIG. 4, it will be understood that these 

in the case of an I/O card 202, to I/O bus interface units 105, connections are repeated, the multiple connections being 

not shown). These buses are used to broadcast addresses omitted from the figure for clarity of illustration. 
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Additionally, a separate point-to-point send status link 
403A-403B (herein generically referred to as reference 
number 403) runs from each memory subsystem 203 to a 
respective port on ASW 212. A separate point-to-point link 
404A-404B (herein generically referred to as reference 
number 404) runs from a respective port on ASW 212 to 
each memory subsystem 203, links 403 and 404 containing 
the same two parallel lines as links 401 and 402. 

The two lines of the status bus are sufficient to specify one 
of four possible status states, each having an associated 
priority, from Table 1 below: 



TABLE 1 



Priority 


Status 


1 


Address Parity Error 


2 


Retry 


3 


Address Acknowledge 


4 


Null (not decoded, no parity error, no retry) 



The lower number indicates a higher priority. The ASW 
collects status received from all devices, and broadcasts the 
highest priority status received. Thus, an address parity 
error, indicating a severe hardware fault, has the highest 
priority and will automatically be broadcast in preference to 
any other response. This will usually abort processing and 
trigger some machine check condition. A retry indicates that 
a device has not had time to determine whether the com- 
mand is intended for it, e.g., a processor has not had time to 
look in its cache to determine whether it has a copy of the 
data. Because the bus architecture operates on a pipeline 
paradigm, all devices must respond on a pre-defined bus 
cycle, whether they have had time to determine the correct 
response or not. Understandably, in some cases a processor 
may be so busy with other high priority tasks that it can not 
make the required determination, but the need to respond 
dictates that it request a retry of the command. An address 
acknowledge is the response when a device determines that 
the command is applicable to it, e.g., it has a copy of the 
requested data, although not necessarily the only copy. A 
device responds with null if the command is not applicable 
to the responding device. 

FIG. 5 shows in greater detail the components of the 
response bus portion of memory bus 103. The response bus 
is similar to the status bus, and comprises multiple unidi- 
rectional point-to-point links running directly between pro- 
cessors 101 or I/O bus interfaces units 105 and ASW 212, 
by-passing ARP 210. A separate unidirectional send 
response link 501A-501B (herein generically referred to as 
reference number 501) runs from each processor 101 (or I/O 
bus interface unit 105, not shown) to a respective input port 
on ASW 212. A separate unidirectional receive response link 
502A-502B (herein generically referred to as reference 
number 502) runs from a respective output port on ASW 212 
to each respective processor 101 (or I/O bus interface unit 
105, not shown). Each unidirectional point-to-point link 501 
or 502 is physically three parallel lines (not including 
parity). While connections to only two processors are shown 
in FIG. 5, it will be understood that these connections are 
repeated, the multiple connections being omitted from the 
figure for clarity of illustration. Additionally, a separate 
point-to-point receive response link 504A-504B (herein 
generically referred to as reference number 504) runs from 
the ASW to each memory subsystem 203, also containing 
three lines. There is no send response link running from the 
memory subsystems to the ASW, since the memory sub- 
systems generate only status, not responses. 
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The three lines of the response bus are sufficient to specify 
one of up to eight responses, each having an associated 
priority (lower number being highest). However, not all 
eight possibilities are used. The responses used in the 
5 preferred embodiment are listed in Table 2 below: 



TABLE 2 



15 



Priority 


Response 


1 


Retry 


2 


Modified-Intervention 


3 


Shared-Intervention 


4 


Rerun 


5 


Shared 


6 


Null or Clean (Not Modified or Shared) 



The ASW collects responses received from all devices, and 
broadcasts the highest priority response received. A retry 
indicates that the device may have modified data, but has not 
had time to determine this. Modified-intervention means that 

20 the responding device has modified data in its cache; this has 
a relatively high priority, because only the modified data 
should be used. Shared-intervention means that the respond- 
ing device has a shared but unmodified copy of data in its 
cache, which should be used instead of the copy from 

25 memory. The shared-intervention response is optionally 
enabled, depending upon system configuration, and location 
of the responding device relative to the bus master. The rerun 
response is similar to retry, but has a lower priority. The 
shared response indicates that the responding device has a 

30 shared (unmodified) copy of the data, but it will not inter- 
vene and send a copy to the bus master. Null indicates that 
the responding device does not have a copy of the data. 

FIG. 6 shows in greater detail the data bus portion of 
memory bus 103. The data portion contains a set of bidi- 

35 rectional parallel local data buses 601A-601H, running 
between processors 101 (or I/O interface units 105, not 
shown) and DSW 211. Because DSW is physically con- 
tained on two chips, the high order bits of each bus 601A, 
601C, 601E, 601G are coupled to one of the DSW chips, 

40 while the low order bits of each bus 601B, 601D, 601F, 
601H are coupled to the other. FIG. 6 illustrates the local 
data buses as two separate buses designated 601A, 601C, 
601E, 601G, respectively, and 601B, 601D, 601F, 601H, 
respectively, but they are logically a single wide parallel bus, 

45 sometimes hereinafter referred to as local data bus 601. Each 
local data bus 601 connects a pair of processors 101 (or a 
single I/O bus interface unit 105) with a respective port on 
DSW 211. DSW 211 contains four bus ports, supporting 
eight processors 101 (or four I/O bus interface units 105). 

50 The data bus portion further contains a set of bidirectional 
point-to-point remote data buses 602A-602D running 
between DSW 211 and the memory subsystems 203. As in 
the case of the local data buses, the high order bits 602A, 
602C are coupled to one of the DSW chips, while the low 

55 order bits 602B, 602D are coupled to the other. Because each 
remote data bus running between a memory subsystem and 
DSW is logically a single wide parallel bus, the pair 602 A 
and 602B or the pair 602C and 602D are sometimes herein 
referred to as remote data bus 602. 

60 There is no buffering of data in DSW 212. As in the case 
of the ARP 210, DSW 212 acts as a repeater unit, holding 
data in registers for two clock cycles when transmitting 
between a local data bus and a remote data bus. 

A separate remote data bus 602 runs from each DSW 211 

65 to each memory subsystem 203. These buses can carry data 
independently on the same bus cycle. DSW 212 can direct 
from any arbitrary local bus to remote bus or vice-versa, 
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provided the buses are free. Thus it is possible for one of the place).' Data bus arbitration is handled by the DSWs. Since 

processors in a given node to be transmitting data to (or there are multiple data bus paths from memory to different 

receiving data from) memory subsystem 0 at the same time processors supporting multiple simultaneous data transfer, 

that another processor in the same node transmits data to (or global data bus arbitration is not necessary. Arbitration can 

receives data from) memory subsystem 1, provided, 5 be handled as the local node level. I.e., the DSW determines 

however, that these two processors are on different local data whether the local data bus 601 and the remote data bus 602 

buses 601. This parallelism can be repeated for each node. required to support a requested data transfer are available, 

Since the system supports up to five nodes, it is possible for and if so, grants the data bus request, without regard to data 

ten data transfers to be occurring simultaneously. bus activity to and from other nodes. 

FIG. 7 shows in greater detail the arbitration lines for 10 A separate unidirectional local data bus request (DBR) 

supporting arbitration in memory bus 103. A set of unidi- line 7UA-711 D (herein referred to genetically as reference 

rectional local address bus request (ABR) lines 701A-701D number 711) runs from each processor 101 (or I/O interface 

(herein referred to generically as reference number 701) runs unit 105) to a respective port on the DSW 211 in its local 

from each processor 101 (or I/O bus interface unit 105) to a node. This line is used by the processor to request the data 

respective port the local ARP unit 210. Each is a single line 15 bus (after the address bus has been granted, as explained 

carrying a single electrical signal (bit), used by the processor more fully below). The DSW grants a processor's (or I/O 

to request the address bus. A corresponding set of unidirec- interface unit's) request by signaling on a local data bus 

tional remote address bus request lines 702A-702B (herein grant (DBG) line 712A-712D (herein referred to generically 

referred to generically as reference number 702) runs from as reference number 712), there being a separate unidirec- 

ARP210toASW 212, i.e., for each line 701, there is one and 20 tional line 712 for each processor, and a separate port for 

only one line 702. ARP 210 acts only as a repeater unit for each such line on DSW 211. While FIG. 7 depicts local lines 

the local bus request lines 701, repeating any signal put on 711 and 712 running to four processors in local node 201, it 

a local bus request line 701 by its processor on the corre- will be understood that local lines 711 and 712 connect to all 

sponding remote address bus request (RDRV_^ABR) line processors, the additional lines being omitted from the figure 

702. Since there are potentially eight processors in a node, 25 for clarity of illustration. 

each ARP receives up to eight lines 701, and drives up to A separate unidirectional memory data bus request (DBR) 

eight lines 702. The ASW has forty ports for receiving up to line 721A-721B (herein referred to generically as reference 

forty (eight lines times five nodes) remote address bus number 721) runs from each memory subsystem 203 to each 

request lines 702. DSW 211, and a separate unidirectional memory data bus 

ASW 212 grants the address bus by signaling on one of 30 grant (DBG) line 722A-722B (herein referred to generically 

the remote address bus grant (RDRV^ABG) links 703A as reference number 722) runs from each DSW 211 to each 

(herein referred to generically as reference number 703). memory subsystem. Lines 721 and 722 support data bus 

Each link 703 is a unidirectional 4-bit bus running from grant and request from the memory subsystem. Although the 

ASW 212 to the local ARP unit 210. Although multiple memory subsystem does not initiate any data transfer 

address bus requests might come in on the same cycle 35 operations, it will request control of the bus to complete a 

(requiring a separate request line for each processor), the data transfer requested by another device. E.g., if a processor 

ASW can only grant the bus to one processor at a time. requests a read of data in memory, the processor initially 

Therefore, there is no need for forty output ports on the ASW obtains the address bus to make the request, but the memory 

supporting up to forty address bus grant lines 703. A 4-bit subsystem must obtain the data bus to fulfill the request once 

bus to each ARP (total of 20 output lines) is sufficient to 40 the data is available for transfer. 

specify the identifier of a single processor to which the Because the address and data buses are arbitrated 

address bus is granted. Four bits are required because there separately, there must be some mechanism for matching data 

are nine possible outputs: a bus grant to any one of the eight to requests. This is accomplished by using a tag. The tag 

processors, or no bus grant at all. accompanies all commands and all data transmissions. The 

The ARP decodes the bus grant signal received on remote 45 tag contains the device identifier of the bus master and a 

bus grant link 703. If a bus grant to one of its local sequence number assigned by the bus master. The identifier 

processors is specified, ARP signals the corresponding pro- is not the same as a memory address. Since it only has to 

cessor on unidirectional local address bus grant line identify a unique bus device, a rather small number of bits 

704A-704D (herein referred to generically as reference is required. In the preferred embodiment, six bits uniquely 

number 704). There is a separate local address bus grant line 50 identify a device, three bits identifying the node, two bits 

704 running from the local ARP to each local processor identifying the local data bus 601 to which the device is 

(total of eight lines). Each ARP unit thus has four input pins attached, and one bit identifying the device on the local data 

for receiving its remote grant line 703, and eight output pins bus. Thus, the identifier not only uniquely identifies a 

for driving local grant lines 704. While FIG. 7 depicts local device, but also identifies its physical location. The sequence 

lines 701 and 704 running to four processors in local node 55 number in the preferred embodiment is four bits. The bus 

201, it will be understood that local lines 701 and 704 master assigns the next available sequence number with 

connect to all processors, and further that additional remote each bus transaction it originates. When data is returned to 

lines 702, 703 connect to all nodes, the additional lines being the bus master on a read, the bus master can match it to a 

omitted from the figure for clarity of illustration. request by using the sequence number. 

Only processors and I/O bus interface units initiate trans- 60 There are additional miscellaneous control lines which are 

fers on the memory bus 103. The memory subsystem itself not shown. In particular, there are various steering signal 

is merely a passive entity, responding to data transfer lines for steering data on the data bus. Because the data bus 

requests. Therefore, the memory subsystem does not have is granted asynchronously from the address bus (as 

any address bus request lines running to ASW 212. explained in greater detail below), it is necessary to direct 

Hie data bus is arbitrated separately from the address bus 65 the DSW where to route the data. This mechanism is 

(although not entirely independently, since the address bus described in greater detail in commonly assigned copending 

must be obtained first before any data transfer can take patent application Ser. No. 09/439,586, to H. Lee Blackmon, 
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ct aL, entitled "Data Routing Using Status — Response these circumstances, such as round-robin, prioritizing based 

Signals", herein incorporated by reference. on type of command, random choice, etc. 

The operation of the memory bus will now be described The local ARP 210 decodes the signals received on 

with reference to the above described figures and to FIGS. RDRV _^ABG 703. If a bus grant to one of its local proces- 

8-13, representing timing diagrams of system operation. As 5 sors (or I/O bus interface units) is indicated, ARP drives the 

shown and described in FIG. 2, a global system clock is corresponding ABG line 704 low at bus cycle "0" (step 804). 

distributed to all key components and controls the timing of This signal is detected by the originating bus master, which 

signals transmitted on the memory bus. A memory bus now knows that it may transmit on the bus. 

transfer is initiated by either a processor 101 or an I/O bus By convention, the bus master must transmit its address/ 

interface unit 105. The initiator of a bus transfer is referred 10 command/tag information on bus cycles "2" and "3" from 

to as the bus master. The following is an very crude receipt of bus grant; the address, etc. are transmitted to ARP 

summary of the steps involved in bus transfer (these steps 210 on local address request bus 301 (step 805). As noted, 

being more fully described herein), it being understood that this is a half-width bus requiring two cycles to transmit all 

some steps may not be performed in all cases, some steps the information. ARP 210 re-sends this information to ASW 

may overlap, and some additional steps may be performed is 212 on remote address request bus 302 at bus cycles "4" and 

in some cases: "5" (step 806). 

(a) bus master requests and is granted the address bus ASW 212 loads the address command information 
from the ASW; received in step 806 into a single wide output register for 

(b) bus master transmits address and command informa- 0Ut P ut m a sin S le bus c y cle > and retransmits the information 
tion on the address bus, which is broadcast to all 20 on remote address broadcast bus 303 at bus cycle "7" (step 
devices; 807). The same address/comm and/tags are transmitted on all 

/ \ j j . • . • ii « . « remote address broadcast buses 303 to all ARPs simulta- 

(c) devices respond with status, which is collected in the * » . . - . . . . 
ASW and broadcast- neously. In the same cycle, the same information is trans- 

* - m . . t mitted to the memory subsystem on memory address broad- 

(d) devices respond with response, which is collected in 25 cast bus 305. The ARPs retransmit the address/command/tag 
the ASW and broadcast; information to all attached processors and I/O interface units 

(d) data sender (which may or may not be bus master) on the local address broadcast buses 304 at bus cycle "9" 
requests and is granted a data bus; and ( s t ep 808). 

(e) data sender sends data on the data bus. Each processor, I/O bus interface unit, and memory 
From a timing standpoint, the simplest bus activity is that 30 reports its 2-bit status to ASW on send status link 401 (or 

which does not involve any use of the data bus, i.e., a link 403, in the case of memory) at bus cycle "11" (step 
command-only bus transfer. An example of a command-only 809). ASW combines the status received from all devices to 
bus transfer would be a request from a processor having a generate a system status. This system status is then broadcast 
shared copy of some data in its cache to obtain exclusive from ASW to all devices on receive status link 402 (or link 
rights to the data (where cache coherency obeys a M-E-S-I 35 404, in the case of memory) at bus cycle "13" (step 810). 
protocol). In this case, it is not always necessary to send the Each processor and I/O bus interface unit (but not memory) 
data itself, but it may be necessary to inform other proces- then generates and transmits its 3-bit response to ASW 212 
sors in the event they have their own shared copies of the on the three response lines of send response link 501 at bus 
data in their caches. FIG. 8 is a timing diagram of bus cycle "15" (step 811). ASW combines the responses and 
activity for a command-only (i.e., no data) bus transfer. 40 broadcasts a system response on receive response link 502, 
At the top of FIG. 8, the global bus clock cycles are . also broadcasting the global response to memory on link 504 
shown. By convention, the receipt of bus grant by the bus at bus cycle "17" (step 812). This completes the command- 
master is designated the "0" cycle. A bus transfer is initiated only transmission. 

by the bus master pulling ABR line 701 low for one cycle All responses are defined by the architecture and must 

801. This is shown as cycle "-6?, since it takes at least six 45 occur at the predefined bus cycle. In the case of responses up 

cycles for bus grant to be propagated back to the bus master. to step 809, the bus cycle is fixed by the architecture. The 

ABR low is detected by the local ARP 210 (i.e., the ARP in bus cycle at which steps 810 and 811 occur is determined by 

the same node as the bus master). After one intervening values stored in a global system configuration register (step 

cycle, ARP re-drives the bus request signal on RDRV__ABR 812 occurs two steps after step 811). This flexibility permits 

line 702 at bus cycle "-4" (step 802). RDRV_ABR low is 50 a variable latency depending on the number and configura- 

detected by ASW 212. tion of attached devices. However, once established for a 

If the address bus is available, ASW 212 will grant the bus particular configuration, the value does not change, and must 

request by driving the appropriate bits on RDRV_ABG link be observed by all processors. 

703 at bus cycle "-2" (step 803). The address bus is It will be observed that while the latency for a command- 
available if there are no other requests from the same node 55 only transmission is a significant number of bus cycles, each 
in the same cycle or the previous cycle (since transmission resource is used for only one cycle at a time. (The local and 
of address/command from the local node to the ASW remote address request buses 301, 302, are used two cycles, 
requires two cycles), and if there are no other requests from but there are multiple copies of these buses). Therefore, it is 
any other requester (i.e., in a different node) in the same possible to initiate a new transmission with every bus cycle 
cycle, since only one request can be propagated out of the 60 (bus masters from the same node can only initiate a new 
ASW per cycle. In the event of multiple conflicting requests transmission every other cycle). Thus, the maximum pos- 
from different bus masters, the ASW will choose one of the sible throughput of the bus is one transmission per clock 
requests for bus grant and defer the other(s) to later bus cycle. 

cycles. Therefore, the timing shown in FIG. 8 should be FIG. 9 is a timing diagram of bus activity for a read- 
viewed as minimum timing, as the ASW may wait additional 65 from-memory. The memory read transmission begins with 
cycles before granting the bus. Any of various conflict exactly the same steps as the command-only transmission 
resolution schemes can be used for awarding the bus under shown in FIG. 8 and explained above, i.e., steps 901-912 are 
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performed at the same bus cycles and using the same 
resources as steps 801-812 explained above, the only dif- 
ference being the information that is being transmitted, 
particularly the command. 

A command to read memory necessarily requires the 
memory interface to go out and get data from physical 
memory cells, i.e., to decode an address, charge an address 
line, etc. The latency for all this memory subsystem activity 
is N bus cycles, which is represented as step 913. I.e., step 
913 is not bus activity, but memory subsystem activity 
which begins after bus cycle "7", when the memory sub- 
system receives an address and command from the ASW on 
memory address broadcast bus 305. The bus architecture is 
not dependent on any particular memory technology, and 
thus the number N may vary. However, if N were less than 
the 10 cycles required to complete steps 909-912, it would 
be necessary for the memory subsystem to wait until these 
steps were completed. 

When the memory subsystem is ready to transmit the 
requested data, it requests the data bus from DSW 211 by 
holding the memory data bus request line 721 low at bus 
cycle "N+4" (step 914). Since data is only transmitted to the 
bus master, which is known to the memory subsystem from 
the tag data, the memory subsystem only signals on the DBR 
line 721 going to the DSW for the bus master's local node. 
An additional steering signal (not shown) is required to 
request a particular local data bus within the bus master's 
local node. Since data is read from one memory subsystem 
(but not both), only one of the DBR lines 721 going into the 
DSW for the bus master's local node is held low in response 
to the read data request. Other lines and buses may concur- 
rently be used for other data transfers. 

If the data bus is available, the DSW 211 responds by 
holding the memory data bus grant line 722 low at bus cycle 
"N-M5" (step 915). Only the DBG line 722 going to the 
requesting memory subsystem is held low. The data bus is 
available if both the local data bus 601 going between the 
DSW and the bus master, and the remote data bus 602 going 
between DSW and the memory subsystem requesting bus 
grant, will be available at the time of transmission. 

The memory subsystem then transmits the data along with 
the tags to DSW on remote data bus 602, beginning at bus 
cycle "N+8" (step 916). Sixteen bytes of data are transmitted 
with each bus cycle, the data bus being 16 bytes wide. The 
tag is repeated with each bus cycle. A single transmission 
can last a maximum of eight cycles (maximum of 128 bytes 
of data). To transmit more data, the bus master must make 
multiple requests for the bus. While transmitting data, the 
sender holds a data valid (DVal) line low. The sender also 
holds a data busy (DBusy) line low, releasing it one cycle 
before end of transmission to let the receiver know that 
transmission is about to end. 

DSW 211 retransmits the data received from the memory 
subsystem to the bus master on the local data bus 601 
servicing the bus master, beginning at bus cycle "N+10" 
(step 917). I.e., data is retransmitted on local data bus 601 
two bus cycles after transmission on remote data bus 602. 
The tag is also repeated each cycle, and DVal and DBusy are 
held low as before. This completes the memory read trans- 
mission. 

FIG. 10 is a timing diagram of bus activity for a write- 
to-memory from the bus master (either a processor 101 or 
I/O bus interface unit 105, in which case data is coming from 
some I/O unit, such as a storage device). The memory write 
transmission begins with exactly the same steps as the 
command-only transmission shown in FIG. 8 and explained 
above, i.e., steps 1001-1012 are performed at the same bus 
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cycles and using the same resources as steps 801-812 
explained above, the only difference being the information 
being transmitted, particularly the command. 

When the bus master receives the bus grant at cycle "0" 

5 (step 1004), it knows that it will be able to begin transmitting 
the address and command at bus cycle "2", and that the 
memory subsystem will receive this information at bus cycle 
"7", these being predefined by the bus architecture. Since 
there is at least this much latency for data to reach the 

10 memory subsystem, it can immediately request the data bus. 
The bus master requests the data bus by holding local data 
bus request (DBR) line 711 low at bus cycle "2" (step 1013). 
If the data bus is available, the DSW 211 responds by 
holding the local data bus grant (DBG) line 712 for the bus 

15 master low at bus cycle "4" (step 1014). The data bus is 
available if both the local data bus 601 going between the 
bus master and the DSW, and the remote data bus 602 going 
between the DSW and the memory subsystem where the 
data will be stored, will be available at the time of trans- 

20 mission. 

The bus master then transmits the data along with the tags 
to DSW 211 on local data bus 601, beginning at bus cycle 
"6" (step 1015). Sixteen bytes of data are transmitted with 
each bus cycle, the tag being repeated with each bus cycle. 

25 As in the case of a read, data transmission is always limited 
to eight cycles (maximum of 128 bytes of data). While 
transmitting data, the bus master holds a data valid (DVal) 
line low. The bus master also holds a data busy (DBusy) line 
low, releasing it one cycle before end of transmission to let 

30 the DSW know that transmission is about to end. 

DSW 211 retransmits the data received from the bus 
master to the memory subsystem on the remote data bus 602 
which runs to the memory subsystem where the data will be 
stored, beginning at bus cycle "8" (step 1016). Thus, the first 

35 16 bytes of data reach the memory subsystem one bus cycle 
after the address and command arrive. The tag is also 
repeated each cycle, and DVal and DBusy are held low as 
before. This completes the memory write transmission. 
The memory bus architecture also supports data transfer 

40 between processors 101 and/or I/O bus interface units 105. 
There are three flavors of such transfer, each with its own 
timing. In one case, the two devices are in the same local 
node 201, but not on the same local data bus 601. In a second 
case, the two devices are in the same local node and are on 

45 the same local data bus 601. In the third case, the two 
devices are in different local nodes. Such a data transfer 
might occur, for example, as a result of an attempt to read 
from memory, where another processor has a more current 
version of the data to be read in its cache, or might be a read 

50 of data from an I/O device. 

FIG. U is a timing diagram of bus activity for a read- 
from-device, where the device which is the source of the 
data is in the same local node as the bus master, but is not 
on the same local data bus 601 (referred to as a read with 

55 local intervention). In the computer system of the preferred 
embodiment, this type of operation occurs when the bus 
master attempts to read from memory, but another device 
(processor) intervenes because it has a more current copy of 
the data in its cache. 

60 Not surprisingly, the read-from-device begins in exactly 
the same manner as a read-from-memory, and from the 
perspective of the bus master, it is a generic read. I.e., steps 
1101-1112 are performed at the same bus cycles and using 
the same resources as steps 901-912 (and 801-812), 

65 explained above. I.e., the bus master sends out only an 
address, and does not necessarily know where the data is 
located. After all the units have responded, and status and 
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response have been broadcast by ASW 212, the devices will the data bus is available, the DSW 211 responds by holding 

know whether the data resides in memory, or some more the corresponding local data bus grant (DBG) line 712 low 

current copy of the data is in another processor. In particular, at bus cycle "20" (step 1214). The data bus is available if 

the source device will know it is the source device (although both the local data bus 601 going between the data source 

other devices will not necessarily know which is the source 5 device and the DSW, and the remote data bus 602 going 

device) because the broadcast status/response having high- between the DSW and the memory subsystem on which the 

est priority came from it. data resides, will be available at the time of transmission. In 

After receiving status, the data source device requests the some cases, e.g., a transfer from an I/O device to a processor, 

data bus by holding its local data bus request (DBR) line 711 the data is not stored in memory, the memory subsystem 

low (step 1113). This is shown in FIG. 11 at bus cycle "18", 10 acting only as a data switch and buffer. In such cases it is 

it being understood that this is the earliest such an event may possible to send the data to either memory subsystem. As 

occur, and that the data source device may be busy with long as the data bus to either subsystem is available, the data 

other data transfers and unable to request the bus immedi- will be sent. If both buses are available, a priority scheme 

ately. will determine which bus to use. 

If the data bus is available, the DSW 211 responds by 15 The source device then transmits the data along with the 
holding the corresponding local data bus grant (DBG) line tags to its local DSW on its local data bus 601, beginning at 
712 low at bus cycle "20" (step 1114). The data bus is bus cycle "22" (step 1215). Sixteen bytes of data and the tag 
available if all three of the following will be available at the are transmitted with each bus cycle. While transmitting data, 
time of transmission: the local data bus 601 going between the source device holds a data valid (Dval) and data busy 
the data source device and the DSW, the local data bus 601 20 (Dbusy) lines low, releasing Dbusy one cycle before trans- 
going between the DSW and the bus master, and the remote mission end. Up to this point, the transfer is essentially the 
data bus 602 going between the DSW and the memory same as the local intervention data transfer described with 
subsystem in which the data resides. The reason all three respect to FIG. 11. 

must be available is that data in memory will be updated at The DSW 211 local to the data source retransmits the data 

the same time. 25 received to one of the memory subsystems on remote data 

The source device then transmits the data along with the bus 602, beginning at bus cycle "24" (step 1216). I.e., data 

tags to DSW on its local data bus 601, beginning at bus cycle is retransmitted on remote data bus 602 to the memory 

"22" (step 1115). Sixteen bytes of data and the tag are subsystem two bus cycles after transmission is received on 

transmitted with each bus cycle. While transmitting data, the local data bus 601 from the data source. The tag is also 

source device holds a data valid (Dval) and data busy 30 repeated each cycle, and DVal and DBusy are held low as 

(Dbusy) lines low, releasing Dbusy one cycle before trans- before. 

mission end. The data received is buffered in the memory subsystem 

DSW 211 retransmits the data received from the data for transmission to the DSW 211 local to the bus master, 

source device to the bus master on the local data bus 601 Ideally, this retransmission begins after a 3-cycle latency, 

servicing the bus master, beginning at bus cycle "24" (step 35 However, the 3-cycle latency is a minimum latency. The 

1116). I.e., data is retransmitted on local data bus 601 going memory subsystem must first obtain the bus going to the bus 

to the bus master two bus cycles after transmission is master, and if this bus is busy the latency could be longer, 

received on local data bus 601 from the data source. The tag The memory subsystem requests the data bus to the bus 

is also repeated each cycle, and DVal and DBusy are held master by lowering its DBR line 721 to the DSW 211 local 

low as before. On exactly the same bus cycles, the DSW 211 40 to the bus master, depicted in FIG. 12 at cycle "24", which 

retransmits the same data and tags to the memory subsystem is the earliest it may occur (step 1217). If the data bus is 

in which the unmodified data resides so that the memory available, the DSW grants the bus by lowering the DBG line 

subsystem may update the data. This completes the trans- 722 to the memory subsystem (step 1218). The data bus is 

mission. available if both the remote data bus 602 running between 

FIG. 12 is a timing diagram of bus activity for a read- 45 the memory subsystem and the DSW 211 local to the bus 

from-device, where the device which is the source of the master is available and the local data bus 601 running 

data is in a different local node (referred to as remote between the DSW 211 and the bus master is available, 

intervention). There is no data bus running directly between After receiving data bus grant, the memory subsystem 

DSWs in different local nodes. In order to perform this type transmits the data held in its buffer along with the tags to the 

of a data transfer, data is sent to the memory controller and 50 DSW 211 local to the bus master on remote data bus 602, 

retransmitted, as described below. Depending on the type of beginning at bus cycle "28" (step 1219). As in the case of 

transmission, the memory controller may or may not store other data transmissions, the memory subsystem holds DVal 

the data in memory after it receives the data; however, from and DBusy low, releasing Dbusy one cycle before end of 

the standpoint of bus timing, this activity is irrelevant, transmission. 

because the memory controller does not wait for the data to 55 DSW 211 retransmits the data received from the memory 
be stored in the memory subsystem before retransmitting it subsystem to the bus master on the local data bus 601 which 
to the bus master. The read with remote intervention trans- services the bus master, beginning at bus cycle "30" (step 
mission begins with exactly the same steps as the command- 1220), also repeating the tag and holding DVal and DBusy 
only transmission shown in FIG. 8 and explained above, i.e., low as before. This completes the read with remote inter- 
steps 1201-1212 are performed at the same bus cycles and 60 vention. 

using the same resources as steps 801-812 explained above, FIG. 13 is a timing diagram of bus activity for a read- 
me only difference being the information that is being from-device, where the device which is the source of the 
transmitted, particularly the command. data is not only in the same local node as the bus master, but 
After status has been broadcast, the devices will know is also on the same local data bus 601. This type of operation 
which one is to send the requested data. The data source 65 is referred to as a local echo, and only occurs between two 
device therefore requests the data bus by holding its local processors 101. The device read transmission begins with 
DBR line 711 low at or after bus cycle "18" (step 1213). If exactly the same steps as the read with local intervention 
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shown in FIG. 11 and explained above, i.e., steps 1301-13i4 
are performed at the same bus cycles and using the same 
resources as steps 1101-1114 explained above. 

In the read with local intervention described above with 
respect to FIG. 11, the data comes into the DSW on one local 
data bus 601 and goes out on another, making it possible to 
use both buses in an overlapping manner, retransmitting data 
two cycles after initial transmission. This is not possible in 
the case of the local echo, because only one local data bus 
601 is available for use. Although this is a bidirectional bus 
serving two processors, the bus architecture is designed so 
that each processor will ignore any transmission from the 
other. A separate DVal line runs from DSW 211 to each 
processor. When the source lowers its DVal line, this is seen 
only by DSW 211, and not the other processor, and hence the 
data being transmitted by the source is initially ignored. The 
DSW must retransmit the data and lower DVal so that the 
bus master will receive it. 

As shown in FIG. 13, the data source begins transmission 
of the data with tag on local data bus 601 at bus cycle "22", 
and in the example case of an 8-cycle transmission (eight 
cycles being the maximum permitted in the architecture), 
completes transmission on bus cycle "29" (step 1315). In 
this transmission, the data source device lowers DVal and 
DBusy going to the DSW. 

The DSW must wait until all the data is transmitted by the 
data source before it can use the same local data bus 601. 
Beginning with bus cycle "31", it retransmits the data 
received with the tag to the bus master on local data bus 601 
(step 1316). In this transmission, the DSW lowers DVal and 
DBusy going to the bus master. 

Since the updated data from a processor's cache should be 
simultaneously updated in memory, DSW 211 concurrently 
transfers the data to the appropriate memory subsystem on 
remote data bus 602 (step 1317). Note that this transfer 
begins at bus cycle "24". I.e., it is not necessary to wait until 
all the data from the source device is received, because it is 
being retransmitted to memory on a physically separate bus 
connection. 

FIG. 14 is a timing diagram of bus activity for a write- 
to-device. This is used, e.g., to write from a processor to an 
I/O interface unit, and always occurs across nodes. Like the 
remote intervention transaction, a write-to-device is accom- 
plished by sending the data first to a memory subsystem 
controller, which then re-transmits it to the destination 
device (without storing the data in the memory subsystem). 
The write-to-device therefore begins with exacdy the same 
steps as the write -to -memory shown in FIG. 10 and 
explained above, i.e., steps 1401-1416 are performed at the 
same bus cycles and using the same resources as steps 
1001-1016 explained above. 

After status/response have been broadcast, the memory 
subsystem will know where to send the requested data. The 
memory subsystem therefore requests the data bus going to 
the local node of the destination device by holding the 
appropriate DBR line 721 low at or after bus cycle "18" 
(step 1417). If the data bus is available, the DSW 211 
responds by holding the corresponding remote data bus 
grant (DBG) line 722 low at bus cycle "20" (step 1418). The 
data bus is available if both the remote data bus 602 going 
between the memory subsystem and the DSW servicing the 
destination device, and the local data bus 601 going between 
the DSW and the destination device, will be available at the 
time of transmission. 

The memory subsystem then re-transmits the data along 
with the tags to the DSW on remote data bus 602, beginning 
at bus cycle "22" (step 1419). While transmitting data, the 



1,469 Bl 

20 

memory subsystem holds a data valid (Dval) and data busy 
(Dbusy) lines low, releasing Dbusy one cycle before trans- 
mission end. The DSW 211 local to the destination device 
retransmits the data received to the destination device on 
local data bus 601, beginning at bus cycle "24" (step 1420). 
The tag is also repeated each cycle, and DVal and DBusy are 
held low as before. This completes the write-to-device 
transaction. 

In the design of processor chips, it is well known that the 
( number of available I/O pins is a severe constraint. The use 
of a narrower bus for transmitting commands/addresses/tags 
reduces the number of I/O pins required of a processor chip. 
FIG. 15 illustrates a processor chip and its I/O pin require- 
ments at a high level. It will be understood that FIG. 15 is 
a greatly simplified diagram, and that a typical processor 
will have many components, such as on-board caches, 
registers, ALUs, instruction decode logic, instruction 
sequencer, etc., which are not shown. In the preferred 
embodiment, the chip has approximately 1000 available 
pins. The processor chip has an on-board LI cache, but the 
L2 cache is on a separate chip. Approximately half of the 
available pins are used by the L2 cache interface 1505. 
Many remaining pins are required for power, clock, testing, 
configuration inputs, etc. This leaves a limited number of 
pins for the processor/memory bus interface 1506. The data 
bus portion includes data bus request pin 1510, data bus 
grant pin 1511, the bidirectional data bus itself 1512, and 
miscellaneous control/steering pins not shown. The address/ 
command bus portion includes command bus request pin 

1501 (for driving a request line 701), command bus grant pin 

1502 (for receiving a grant line 704), command output pins 

1503 (for driving link 301) and command input pins 1504 
(for receiving link 304). It also includes miscellaneous 
status, response and other pins collectively designated 1520. 
According to the preferred embodiment of the present 
invention, the number of command output pins 1503 
required is approximately half that of the number of com- 
mand input pins 1504. The number of pins shown in FIG. 15 
is intended as a conceptual representation only; the actual 
number is much greater and would be difficult to illustrate in 
a single drawing. 

Thus bus architecture described herein facilitates connec- 
tion to a relatively large number of processors and I/O 
interface units. In the preferred embodiment, there is an 
architectural limit of approximately 37 devices, this limit 
being imposed largely by the number of available I/O pins 
in the DSW, ASW and ARP chips. However, it will readily 
be appreciated by those skilled in the art that the architec- 
tural concepts explained herein could easily be employed to 
support larger number of devices, e.g., by using additional 
switching chips. 

In the preferred embodiment, a particular timing sequence 
is presented. It will readily be understood by those skilled in 
the art that the timing sequences presented herein are 
exemplary in nature, and that the architectural concepts 
explained herein could use different timing sequences, pro- 
vided the timing sequence protocol is observed by all bus 
devices. It will further be understood that in some cases 
there may be a variable delay between the occurrence of 
certain events which require separate arbitration, and that the 
timing diagrams presented herein show only one set of 
examples of such delays. 

In the preferred embodiment, certain status and response 
signals are transmitted, these being divided in two different 
bus cycles. It would alternatively be possible to combine 
these responses in a single bus cycle, to transmit other 
information as part of status/response, to allocated addi- 
tional bus cycles for status response, etc. 
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